For S3D and 4D meta-analysis see (K. Kim, et al. 2017).
In general defining feature importance is a common task. Though it is typically model dependant is
We’ll take inspiration from the Scree plot and try to apply it to the LDA-like approach Consider a sree plot of flea data. This shows which componets are contributing to the full sample, full dimensionality, \([n,p]\) variation of the data.
The user study task trys to explore the full sample, full dimensionality \([n,p]\) seperation of two specified clusters. In an analogous manner, let’s try to create a screeplot-like output to evalute the contributions of the original variables. Note that this is related to what R. Fisher is attempting in his 1936 paper. Similarly we start by finding cluster means and covariances.
Cluster means:
| tars1 | tars2 | head | aede1 | aede2 | aede3 | |
|---|---|---|---|---|---|---|
| Cluster means of: Concinna | 183.1 | 129.6 | 51.24 | 146.2 | 14.10 | 104.9 |
| Cluster means of: Heptapot. | 138.2 | 125.1 | 51.59 | 138.3 | 10.09 | 106.6 |
| Cluster means of: Heikert. | 201.0 | 119.3 | 48.87 | 124.7 | 14.29 | 81.0 |
Cluster variance-covariance matrices: For Concinna , Heptapot. , Heikert. respectively
|
|
|
Suppose the clusters in questions are Concinna and Heptapot. The line between the the cluster means of these groups is their difference. This is sufficeint for Linear Discriminant Analysis (LDA) which assumes homogenious variation between clusters. We’ll follow Fisher’s Discriminant Analysis, which accounts for within cluster variance.
\[ Cluster Seperation_{[1,p]} = (\mu_{b[1,p]} - \mu_{a[1,p]})^2~/~(\Sigma_{a[p, p]} + \Sigma_{b[p, p]})~~~;~a,~b~are~clusters \in X_{[n,p]}\]
| var | var_clSep | cumsum_clSep |
|---|---|---|
| aede1 | 0.48 | 0.48 |
| aede2 | 0.40 | 0.88 |
| tars2 | 0.38 | 1.27 |
| tars1 | 0.24 | 1.50 |
| head | 0.21 | 1.71 |
| aede3 | 0.15 | 1.86 |
We discard the sign as we only care about magnitude each variable contributed to the seperation of the specified clusters. We scale the absolute terms by the inverse of the sumation. Now let’s visualize this similar to the screeplot.
Now that we have a measure we want to define an objective cutoff for evaluation. We want the measure to a few attributes:
Following these, we define a measure to be:
\[Marks = \sum_{i = 1}^{p}(\sqrt{ClusterSeperation_i} - (1 / (p - 1)) * I(Response_i)\]
Here, we add lines indicating the weight of each variable if selected as important. we then apply our measure to evalue task responses, we review an example response below:
| variable | var_clSep | Weight | exampleResponse | Marks |
|---|---|---|---|---|
| aede1 | 0.48 | 0.50 | 1 | 0.50 |
| aede2 | 0.40 | 0.43 | 1 | 0.43 |
| tars2 | 0.38 | 0.42 | 0 | 0.00 |
| tars1 | 0.24 | 0.29 | 1 | 0.29 |
| head | 0.21 | 0.26 | 0 | 0.00 |
| aede3 | 0.15 | 0.19 | 1 | 0.19 |
Total Marks = 1.4
All linear projections are nesciarily a lossy representation of the full data. By this we mean that no single 2D frame can show the whole set of infromation for \(p>=3\) -dimensional sample. Any pair of Pricipal Components nessciaronly shows less than all the variation, namely the sum of their contributions, typicaly stated as percentage of full sample variation. Analogously any single projection cannot show the full information explain the cluster seperation of 2 given clusters.
In applcation, viewing a PC1 by PC2 biplot of flea data contains 94.72 percent of the variation explained in the sample. While viewing (an orthogonal project) the top 2 variables (namely: aede1, aede2 ) explain 88.15 percent of the within sample cluster seperation between Concinna and Heptapot.
In order to stress test this Cluster seperation viewed by a screeplot we apply it to other toy datasets.
(invalid assumptions, as there are 3 species clusters for each sex) ### Penguins, between levels of sex with 1 species
Can we simulate the Cluster seperation that we expect? Let’s create a simmulation that has variable contributions for the following cases:
Observe how changing the variance-covariances changes cluster seperation given that cluster means differ as 80, 20, rep(0) (singal from means is large relative to variance) 1. 2 varaibles 2. 5 varaibles 3. 5 variables, within each cluster V1-V2 covariance set to .3 4. 5 variables, Cluster 1 covariance: all off diagonal set to .7, diagonals set to 5. Cluster 1 covariance: diag(5)
In order to properly distinguish a difference between the 3 vizualization factors the data must be of suitable complexity, such that it has the following properties:
Let’s try to evaluate our current generation of data simulations against these properties
This was a 300 series simulation done at the end of the generation 1 user study shiny app.
Seems sufficent to be complex enough not to be seen as a pair of components within the first 4 Principal Components. Now to see if we can see anything in radial tours of all variables. We’ll view cl Sep to explore which variables should contain contributions.
Fisher, Ronald A. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7, no. 2 (September 1936): 179-88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.